XML-Enabled Data Extraction for Web Sources
نویسندگان
چکیده
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applications need a smart way of extracting data from these web sources. One of the popular approaches is to write wrappers around the sources, either manually or with software assistance, to bring the web data within the reach of more sophisticated query tools and general mediator-based information integration systems. In this paper, we describe the methodology and the software development of an XML-enabled wrapper construction system XWRAP for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query-based content ltering process is performed against the XML documents. The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates tasks of building wrappers that are speci c to a Web source from the tasks that are repetitive for any source, and uses a component library to provide basic building blocks for wrapper programs. Second, it provides inductive learning algorithms that derive or discover wrapper patterns by reasoning about sample pages or sample speci cations. Third and most importantly, we introduce and develop a twophase code generation framework. The rst phase utilizes an interactive interface facility to encode the source-speci c metadata knowledge identi ed by individual wrapper developers as declarative information extraction rules. The second phase combines the information extraction rules generated at the rst phase with the XWRAP component library to construct an executable wrapper program for the given web source. We report the performance of XWRAP and our experiments by demonstrating the bene t of building wrappers for a number of Web sources in di erent domains using the XWRAP generation system. This research is partially supported by DARPA contract MDA972-97-1-0016 and a grant from Intel. An extended abstract of this paper was appeared in ICDE 2000.
منابع مشابه
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or application...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملAn XML-enabled data extraction toolkit for web sources
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text files. Data in these formats are not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applicat...
متن کاملExtracting semistructured data from the Web: An XQuery Based Approach
This paper describes work in progress concerning the extraction of information from the web. This work is a part of frameworks consisting to extract, interconnect and access heterogeneous data sources. In this paper, we present a new approach for information extraction from the web. In this approach the web is viewed as a large database containing XML documents. The XQuery language is used in o...
متن کاملCognitive Agents for Automatic Generation of Valid XML Documents
The World Wide Web can be considered as a huge collection of possible interesting information sources and because of the great amount of information available users need ways to extract and summarize relevant data and presenting them in an appropriate format see for a survey of tools helping the user in accessing data in the WWW To improve e ectiveness of information access and knowledge manage...
متن کامل